White Wine Quality Exploration by Rica Enriquez

The white wine quality dataset from Cortez et al. (2009) is explored. The wines are classified as Vinho Verde and are exclusively produced in the demarcated region of Vinho Verde in northwestern Portugal. These wines are described to possess “vibrant freshness, elegance, lightness and aromatic and flavorful expressions.” The paper and data can be found here:

P. Cortez, A. Cerdeira, F. Aloesseida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

The attributes included in the data set are described below (taken from “wineQualityInfo.txt”):

  1. Fixed acidity: most acids involved with wine or fixed or nonvolatile (does not evaporate readily)
  2. Volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste
  3. Citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines
  4. Residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet
  5. Chlorides: the amount of salt in the wine
  6. Free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine
  7. Total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine
  8. Density: the density of water is close to that of water depending on the percent alcohol and sugar content
  9. pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale
  10. Sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant
  11. Alcohol: the percent alcohol content of the wine
  12. Quality: score between 0 and 10

Based on the descripitons of Vinho Verde wines and the attributes in the data set, the features that are predicted to positively contribute to quality are:

  1. Citric acid - adds freshness
  2. Total sulfur dixoide - adds taste

The features that are predicted to negatively contribute to quality are:

  1. Density - subtracts lightness
  2. Volatile acidity - adds an unpleasant, vinegar taste

The following analysis explores the attributes in a systematic manner. The features that mainly influence quality are then further investigated.

References

  1. To create a function to loop over attributes and create individual plots
  2. To create histograms of all attributes

The Data Set

## 'data.frame':    4898 obs. of  14 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
##  $ score               : Factor w/ 7 levels "3","4","5","6",..: 4 4 4 4 4 4 4 4 4 4 ...
##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##                                                                   
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##                                                        
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##                                                                        
##     alcohol         quality      score   
##  Min.   : 8.00   Min.   :3.000   3:  20  
##  1st Qu.: 9.50   1st Qu.:5.000   4: 163  
##  Median :10.40   Median :6.000   5:1457  
##  Mean   :10.51   Mean   :5.878   6:2198  
##  3rd Qu.:11.40   3rd Qu.:6.000   7: 880  
##  Max.   :14.20   Max.   :9.000   8: 175  
##                                  9:   5

There are 4898 white wines in the dataset with 11 real features (“fixed.acidity”, “volatile.acidity”, “citric.acid”, “residual.sugar”, “chlorides”, “free.sulfur.dioxide”, “total.sulfur.dioxide”, “density”, “pH”, “sulphates”, and “alcohol”). The analysis of the dataset is centered on how these features are related to the “quality” of a wine. An extra variable, “score”, is an ordered factor version of the “quality” feature. The “score” ranges from 0 - 10 (best).

Most wines are rated in the middle to mid-high (5-6), with a median “quality” of 6. Most features appears to have large outliers. “density” and “pH” might be exceptions. These characteristics might be easier to measure accurately than the other features. The median “residual.sugar” content is 5.200 \(g/dm^3\) and the median “alcohol” content is 10.51 vol.%.

Histogram of Wine Qualities

The quality of white wines appear to follow somewhat of a normal distribution. The scale is from 0 - 10, but the lowest score given was a 3 (20 wines) and the highest was a 9 (5 wines). Are there common profiles for the worst and best wines?

Histograms

Base Histograms of Wine Properties

A set of base histograms are created for all attributes.

The plots above show the distribution of all of the features. The base histograms show that “fixed.acidity”, “citric.acid”, “total.sulfur.dioxide”, “density”, “pH”, and “sulphates” are normally distributed while “volatile.acidity”, “residual.sugar”, “chlorides”, “free.sulfur.dioxide”, and “alcohol” have skewed distributions. However, binwidths and axes need adjustment in order to find any unexpected distributions. Histograms of features that provide significantly more information than the base histograms are presented below.

Altered Histogram of Citric Acid

There appears to be a ~0.3 \(g/dm^3\) peak and a ~0.5 \(g/dm^3\) spike. It would be interesting to know which wines have a citric acid content ~0.5 \(g/dm^3\).

Altered Histogram of Resdiual Sugar

There appears to be a bimodal distribution for residual sugar content. There are probably wines for people who prefer drier wines and for othes who prefer sweeter wines. It would be interesting to know the properties of these two subsets.

Altered Histogram of Chlorides

There is a long tail of higher chloride concentrations for the lower quality wines.

Altered Histogram of Free Sulfur Dioxide

The lower quality wines tend to have lower free sulfur dioxide concentrations.

Altered Histogram of Alcohol

The higher quality wines tend to have more alcohol content.

Box Plots

Box Plots of Wine Properties, Separated by Wine Quality

A set of box plots are created for all features. The data is limited to the middle 90%. The mean for each category is plotted as an “x”.

The features that appear to vary with wine quality are “volatile.acidity”, “citric.acid”, “residual.sugar”, “chlorides”, “free.sulfur.dioxide”, “total.sulfur.dioxide”, “density”, “pH”, and “alcohol”. It seems that the variation within a feature is more obvious through these set of box plots than the previous set of histograms. The top features appear to be “density” and “alcohol”.

Correlations and Scatterplots

Scatterplots and correlation calculations of the characteristics of wine that might be more closely associated to quality should help ascertain which features might be important to quality. Additionally, how the individual features correlate with each other will be investigated.

Correlation Coefficients and a Pair Plot for Selected Features

##                      volatile.acidity  citric.acid residual.sugar
## volatile.acidity           1.00000000 -0.149471811     0.06428606
## citric.acid               -0.14947181  1.000000000     0.09421162
## residual.sugar             0.06428606  0.094211624     1.00000000
## chlorides                  0.07051157  0.114364448     0.08868454
## free.sulfur.dioxide       -0.09701194  0.094077221     0.29909835
## total.sulfur.dioxide       0.08926050  0.121130798     0.40143931
## density                    0.02711385  0.149502571     0.83896645
## pH                        -0.03191537 -0.163748211    -0.19413345
## alcohol                    0.06771794 -0.075728730    -0.45063122
## quality                   -0.19472297 -0.009209091    -0.09757683
##                        chlorides free.sulfur.dioxide total.sulfur.dioxide
## volatile.acidity      0.07051157       -0.0970119393          0.089260504
## citric.acid           0.11436445        0.0940772210          0.121130798
## residual.sugar        0.08868454        0.2990983537          0.401439311
## chlorides             1.00000000        0.1013923521          0.198910300
## free.sulfur.dioxide   0.10139235        1.0000000000          0.615500965
## total.sulfur.dioxide  0.19891030        0.6155009650          1.000000000
## density               0.25721132        0.2942104109          0.529881324
## pH                   -0.09043946       -0.0006177961          0.002320972
## alcohol              -0.36018871       -0.2501039415         -0.448892102
## quality              -0.20993441        0.0081580671         -0.174737218
##                          density            pH     alcohol      quality
## volatile.acidity      0.02711385 -0.0319153683  0.06771794 -0.194722969
## citric.acid           0.14950257 -0.1637482114 -0.07572873 -0.009209091
## residual.sugar        0.83896645 -0.1941334540 -0.45063122 -0.097576829
## chlorides             0.25721132 -0.0904394560 -0.36018871 -0.209934411
## free.sulfur.dioxide   0.29421041 -0.0006177961 -0.25010394  0.008158067
## total.sulfur.dioxide  0.52988132  0.0023209718 -0.44889210 -0.174737218
## density               1.00000000 -0.0935914935 -0.78013762 -0.307123313
## pH                   -0.09359149  1.0000000000  0.12143210  0.099427246
## alcohol              -0.78013762  0.1214320987  1.00000000  0.435574715
## quality              -0.30712331  0.0994272457  0.43557472  1.000000000

The “alcohol” feature has a strong correlation with wine quality. The “density” feature has a moderately strong correlation with the quality of wine. “residual.sugar” presents a bimodal distribution, and would not necessarily have a strong linear correlation with wine quality. It will be explored further since “density”, and “alcohol” have strong correlations with “residual.sugar”. In fact, the correlation between “density” and “residual.sugar” is the strongest in this set. Similarly, “total.sulfur.dioxide” has a strong correlation with “density”. “alcohol” also has a strong correlation with “chlorides” and “total.sulfur.dioxide”. The relation of these variables with each other and quality will be analyzed more closely.

Middle 90% of Alcohol vs. Density

There is a strong negative linear relationship between “density”" and “alcohol”. As “alcohol” increases, “density”" decreases. The higher quality wines tend cluster in the lower right region, which is the higher alcohol content, lower “density” wines. It would be interesting to see if this ratio is a better feature. Additionally, there appears to be a clear separation between wines with higher “residual.sugar” and lower “residual.sugar”. The higher “residual.sugar” wines tend to have a higher density:alcohol ratio.

Middle 90% of Alcohol vs. Residual Sugar

The “residual.sugar” decreases with increasing “alcohol”. However, there isn’t a strong linear relationship between “residual.sugar” and “alcohol” for the whole range. An exponential decay in “residual.sugar” appears to exist with increasing “alcohol”. The higher “quality” wines tend to have a higher “alcohol” and lower “residual.sugar” content.

Middle 90% of Alcohol vs. Total Sulfur Dioxide

There is a negative linear relationship between “total.sulfur.dioxide” and “alcohol”. As “alcohol” increases, “total.sulfur.dioxide” decreases.

Middle 90% of Alcohol vs. Chlorides

There is a negative linear relationship between “chlorides” and “alcohol”. As “alcohol” increases, “chlorides” decreases.

Middle 90% of Density vs. Residual Sugar

The “residual.sugar” increases with increasing “density”. The relationship between “residual.sugar” and “density” appears to be more of an exponential growth. The higher “quality” wines tend to have a higher “residual.sugar”:“density” ratio.

Middle 90% of Density vs. Total Sulfur Dioxide

There is a linear relationship between “total.sulfur.dioxide” and “density”. As “total.sulur.dioxide” increases, “density” increases. This plot also shows that higher “quality” wines tend to have lower “density” values.

Correlations of Density:Residual Sugar and Density:Alcohol vs. Quality

The previous scatterplots showed that the Density:Residual Sugar and Density:Alcohol may be important transformed features. Here, the correlation of the ratios with quality is examined.

The correlation between “density”:“residual.sugar” and “quality” is 0.008996164 and is low, showing that it is not an important feature for quality. The correlation between “density”:“alcohol”" and “quality” is -0.4244115. It is a strong negative correlation, but the positive correlation between “quality” and “alcohol” is stronger.

Group by Quality Plots

Now that the secondary attributes are investigated, the median attributes of “alcohol” and “density” by quality are investigated.

Medians of “alcohol” and “density” values may have linear relationships with “quality”. The “alcohol” medians provide a correlation of 0.8476837 with “quality”. The “density” medians provide a correlation of -0.8885051 with “quality”.

Secondary Citric Acid Peak

The secondary peak is at 0.49 \(g/dm^3\), which is much greater than the average concentration found at every quality level.The citric acid supposedly adds ‘freshness’ and flavor to wines. There are 215 wines with this acidity, and the majority of them are mediocre wines with average sugar content, average alcohol content, and average density (compared to global averages).

Residual Sugar Subsets

For this part of the analysis, the wines are separated into two sets. A set of wines that has 4 \(g/dm^3\) residual sugar content or less, and a set of wines that has more than 4 \(g/dm^3\) residual sugar.

Residual Sugar Subset - 4 \(g/dm^3\) and under

##     volatile.acidity          citric.acid       residual.sugar 
##           0.26782785           0.32875060           1.82293753 
##            chlorides  free.sulfur.dioxide total.sulfur.dioxide 
##           0.04411016          29.90295660         120.12398665 
##              density                   pH              alcohol 
##           0.99182044           3.21308059          11.00906056

The low “residual.sugar” subset has slightly higher correlation magnitudes between “quality” and most other attributes, on average. However, the magnitude of correlations of all other attributes are generally lower in this subset than the ones in the full data set. The “residual.sguar” correlations are surprisingly much much lower in this subset. The average attributes of this subset are also similar to the average attributes of the full data set. The exceptions include: lower “residual.sugar”, “free.sulfur.dioxide”, and “total.sulfur.dioxide” values.

Residual Sugar Subset - over 4 \(g/dm^3\)

##     volatile.acidity          citric.acid       residual.sugar 
##           0.28603713           0.33826491           9.81165655 
##            chlorides  free.sulfur.dioxide total.sulfur.dioxide 
##           0.04701678          39.35469475         152.01374509 
##              density                   pH              alcohol 
##           0.99567963           3.16968940          10.14383434

This subset does not appear to be significantly different from the full data set (other than “residual.sugar”) even though it contains only about 57% of the wines,. This subset does have slighlty higher means in “chlorides” and “free.sulfur.dioxide”.

Worst Wine Analysis

##     volatile.acidity          citric.acid       residual.sugar 
##           0.37598361           0.30770492           4.82103825 
##            chlorides  free.sulfur.dioxide total.sulfur.dioxide 
##           0.05055738          26.63387978         130.23224044 
##              density                   pH              alcohol 
##           0.99434306           3.18338798          10.17349727

The worst wines have a higher “volatile.acidity” and “chlorides” andlower “citric.acid”, “residual.sugar”, and “free.sulfur.dioxide” than the average of all wines. The “quality” of the worst wines are more associated with “free.sulfur.dioxide” and “total.sulfur.dioxide”. “alcohol” and “density” hardly correlate with “quality”. Perhaps this is because there are only two levels for “quality”.

Best Wine Analysis

##     volatile.acidity          citric.acid       residual.sugar 
##           0.27797222           0.32816667           5.62833333 
##            chlorides  free.sulfur.dioxide total.sulfur.dioxide 
##           0.03801111          36.62777778         125.88333333 
##              density                   pH              alcohol 
##           0.99221439           3.22116667          11.65111111

The best wines have lower “chlorides” and more “alcohol”. Similarly, alcohol" and “density” hardly correlate with “quality”. This is probably because there are only two levels for “quality” in this subset as wll.

Note on Not Building a Linear Model

A linear model does not seem appropriate for predicting the quality of a wine since “qualtiy” is a categorical variable. In fact, Cortez et al. (2009) use Support Vector Machine (SVM) to predict the quality of wine. The ultimate outcome of this analysis is to highlight the main features of Vinho Verde white wine quality.

Final Plots and Summary

Summary

The features that contribute to Vinho Verde white wine “quality” the most are “alcohol” and “density”. “residual.sugar”, citric.acid“,”chlorides“, and”free.sulfur.dioxide" may also be the next important features. These features are different from the predicted features (“citric.acid”, “total.sulfur.dixoide”, and “volatile.acidity”) that were based on the Vinho Verde description and features descriptions. The only feature that was predicted to be importatnt was “density”. Correlations between the features were examined and interesting subsets were further analyzed.

Plot One

Alcohol content appears to influence wine “quality” positively, if the wine is at least mediocre (quality level of 5 and greater). Generally, as alcohol content increases, wine quality also increases. The top plot shows the range of the alcohol content for each “quality” level and the bottom plot shows how both the median and means of alcohol content increase with “quality”.

Plot Two

“density” and “alcohol” have the largest correlation with “quality”, but there is an underlying relationship between “density”, “alcohol”, and “residual.sugar”. The top plot shows how “density”" decreases with increasing “alcohol”" content. Additionally, there is a clear separtion between higher and lower “residual.sugar” content. While “residual.sugar”and “density”:“residual.sugar” did not highly correlate with “quality”, this plot shows that there is a separation in “quality”. Since these three features are highly correlated, perhaps not all three features should be included in predicting wine “quality”.

Plot Three

This last plot shows that there are two subsets of sugar content wine for people who prefer sweeter wines or drier wines. It was analyzed that other than sugar content, these subsets were not significantly different from the average of all wnes.

Reflection

The most difficult parf of the analysis was recognizing that wine “quality” could not be treated as a continuous variable, like diamond price. At first, I was trying to find which feature transformations would lead to the highest correlation with “quality”. Since there are only seven “quality” categories, correlations do not tell the whole story. After this realization, the analysis became easier and I was able to ask interesting questions. I think finding the interesting subsets (citric acid peak, and residual sugar subsets) will be useful for future analysis and applying a machine learning algoritm.